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Abstract 

We study 4 problems in string matching, namely, regular expression matching, approximate 
regular expression matching, string edit distance, and subsequence indexing, on a standard word 
RAM model of computation that allows logarithmic-sized words to be manipulated in constant 
time. We show how to improve the space and/or remove a dependency on the alphabet size 
for each problem using either an improved tabulation technique of an existing algorithm or by 
combining known algorithms in a new way. 

Keywords: Regular Expression Matching; Approximate Regular Expression Matching: String 
Edit Distance; Subsequence Indexing; Four Russian Technique. 

1 Introduction 

We study 4 problems in string matching on a standard word RAM model of computation that 
allows logarithmic-sized words to be manipulated in constant time. This model is often called the 
trans dichotomous model. We show how to improve the space and/or remove a dependency on the 
alphabet size for each problem. Three of the results are obtained by improving the tabulation of 
subproblems within an existing algorithm. The idea of using tabulation to improve algorithms is 
often referred to as the Four Russian Technique after Arlazarov et al. [I] who introduced it for 
boolean matrix multiplication. The last result is based on a new combination of known algorithms. 
The problems and our results are presented below. 

Regular Expression Matching Given a regular expression R and a string Q, the Regular 
Expression Matching problem is to determine if Q is a member of the language denoted by R. 
This problem occurs in several text processing applications, such as in editors like Emacs |24] or in 
the Grep utilities [30, 21J. It is also used in the lexical analysis phase of compilers and interpreters, 
regular expressions are commonly used to match tokens for the syntax analysis phase, and more 
recently for querying and validating XML databases, see e.g., [121 LT31 [I6l [6] . The standard textbook 
solution to the problem, due to Thompson [25], constructs a non-deterministic finite automaton 
(NFA) for R and simulates it on the string Q. For R and Q of sizes m and n, respectively, 
this algorithm uses 0(mn) time and 0(m) space. If the NFA is converted into a deterministic 
finite automaton (DFA), the DFA needs 0(^2 2m a) words, where a is the size of the alphabet 
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£ and w is the word size. Using clever representations of the DFA the space can be reduced to 
O(22y(2 m + 0")) |3ni23j . Efficient average case algorithms were given by Baeza- Yates and Gonnet [3J. 

Normally, it is reported that the running time of traversing the DFA is 0(n), but this complexity 
analysis ignores the word size. Since nodes in the DFA may need f2(m) bits to be addressed, we 
may need Q(m/w + 1) time to identify the next node in the traversal. Therefore the running 
time becomes 0{mn/w + n + m) with a potential exponential blowup in the space. Hence, in the 
transdichotomous model, where w is 0(log(n + m)), using worst-case exponential preprocessing 
time improves the query time by a log factor. 

The fastest known algorithm is due to Myers |17j . who showed how to achieve 0(mn/k + m2 k + 
(n + m) logm) time and 0(2 k m) space, for any k < w. In particular, for k = elogn, for constant 
< e < 1, this gives an algorithm using 0(mn/ log n + (n + to) logm) time and 0{mn e ) space. 

In Section [21 we present an algorithm for Regular Expression Matching that takes time 
0(nm/k + n + mlogm) time and uses 0(2 k + m) space, for any k < w. In particular, if we pick 
k = elogn, for constant < e < 1, we are (at least) as fast as the algorithm of Myers, while 
achieving 0(n e + m) space. 

We note that for large word sizes (w > log 2 n) one of the authors has recently devised an even 
faster algorithm using very different ideas [7]. This research was done after the work that led to 
the results in this paper. 

Approximate Regular Expression Matching Motivated by applications in computational 
biology, Myers and Miller [18] studied the Approximate Regular Expression Matching prob- 
lem. Here, we want to determine if Q is within edit distance d to any string in the language given by 
R. The edit distance between two strings is the minimum number of insertions, deletions, and sub- 
stitutions needed to transform one string into the other. Myers and Miller [18] gave an 0(mn) time 
and O(m) space dynamic programming algorithm. Subsequently, assuming as a constant sized al- 
phabet, Wu, Manber and Myers [32] gave an Q( mn |°g(^+ 2 ) _|_ ra _|_ m ) time and 0( m ^™i°gjf' +2 ^ +n+m) 
space algorithm. Recently, an exponential space solution based on DFAs for the problem has been 
proposed by Navarro [22J. 

In Section [3] we extend our results of Section [2] and give an algorithm, without any assumption 
on the alphabet size, using o( mnlo |( rf + 2 ) + n + m log m) time and 0(2 k + m) space, for any k < w. 

String Edit Distance We conclude by giving a simple way to improve the complexity of the 
String Edit Distance problem, which is defined as that of computing the minimum number of 
edit operations needed to transform given string S of length m into given string T of length n. The 
standard dynamic programming solution to this problem uses 0(mn) time and 0(min(m, n)) space. 
The fastest algorithm for this problem, due to Masek and Paterson p3], achieves 0(mn/k 2 + m + n) 
time and 0(2 k + min(n, m)) space for any k <w. However, this algorithm assumes a constant size 
alphabet. For long word sizes faster algorithms can be obtained [I9l [5]. See also the survey by 
Navarro [20] . 

In Section [U we show how to achieve 0(nm log 2 k/k 2 + m + n) time and 0(2 k + min(n, m)) 
space for any k < w for an arbitrary alphabet. Hence, we remove the dependency of the alphabet 
at the cost of a log 2 k factor to the running time. 

Subsequence Indexing We also consider a special case of regular expression matching. Given 
text T, the Subsequence Indexing problem is to preprocess T to allow queries of the form 
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"is Q a subsequence of T?" Baeza- Yates [3] showed that this problem can be solved with 0(n) 
preprocessing time and space, and query time 0(m log n), where Q has length m and T has length 
n. Conversely, one can achieve queries of time 0{m) with 0(no~) preprocessing time and space. As 
before, a is the size of the alphabet. 

In Section[5l we give an algorithm that improves the former results to 0(m log logo") query time 
or the latter result to 0(na e ) preprocessing time and space. 

2 Regular Expression Matching 

Given an string Q and a regular expression R the Regular Expression Matching problem 
is to determine if Q is in the language given by R. Let n and m be the sizes of Q and R, 
respectively. In this section we show that Regular Expression Matching can be solved in 
0{mn/k + n + mlogm) time and 0(2 k + m) space, for k < w. 

2.1 Regular Expressions and NFAs 

We briefly review Thompson's construction and the standard node set simulation. The set of regular 
expressions over £ is defined recursively as follows: 

• A character a £ £ is a regular expression. 

• If S and T are regular expressions then so is the catenation, (S) ■ (T), the union, (S)\(T), and 
the star, (5)*. 

Unnecessary parentheses can be removed by observing that • and | are associative and by using the 
standard precedence of the operators, that is * precedes •, which in turn precedes |. Furthermore, 
we will often remove the • when writing regular expressions. The language L(R) generated by 
R is the set of all strings matching R. The parse tree T(R) of R is the rooted and ordered tree 
representing the hierarchical structure of R. All leaves are represented by a character in £ and all 
internal nodes are labeled •, |, or *. We assume that parse trees are binary and constructed such 
that they are in one-to-one correspondence with the regular expressions. An example parse tree of 
the regular expression ac\a*b is shown in Fig. [2ja). 

A finite automaton A is a tuple A = (G, £, 9, <£) such that, 

• G is a directed graph, 

• Each edge e G E(G) is labeled with a character a E E or e, 

• 9 G V(G) is a start node, 

• $ C V{G) is the set of accepting nodes. 

A is a deterministic finite automaton (DFA) if A does not contain any e-edges, and for each node 
v G V(G) all outcoming edges have different labels. Otherwise, A is a non- deterministic automaton 
(NFA). We say that A accepts a string Q if there is a path from 9 to a node in $ which spells out 
Q. 

Using Thompson's method [25] we can recursively construct an NFA N(R) accepting all strings 
in L(R). The set of rules is presented below and illustrated in Fig. [H 
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(c) (d) 



Figure 1: Thompson's NFA construction. The regular expression for a character a G X correspond 
to NFA (a). If S and T are regular expression then N(ST), N(S\T), and N(S*) correspond to 
NFAs (a), (6), and (c), respectively. Accepting nodes are marked with a double circle. 

• N(a) is the automaton consisting of a start node 9 a , accepting node <j> a , and an a-edge from 
9 a to 4> a . 

• Let N(S) and N(T) be automata for regular expression S 1 and T with start and accepting 
nodes 9s, Ot, <t>s, an d 4>t, respectively. Then, NFAs for N(S ■ T), N(S\T), and N(S*) are 
constructed as follows: 

N(ST): Merge the nodes (f>s and 9t into a single node. The new start node is 9s and the 
new accepting node is <f>r- 

N(S\T): Add a new start node 9 s \t and new accepting node <Ps\t- Then, add e edges from 

9 s \t to 9s and 9t, and from </>s and (jxp to </>s|t- 
N(S*): Add a new start node 9s* and new accepting node <ps*- Then, add e edges from 9s* 

to #s and 4>s*, and from 05 to 4>s* and #5. 

By construction, N(R) has a single start and accepting node, denoted 9 and (ft, respectively. 9 
has no incoming edges and 4> has no outcoming edges. The total number of nodes is at most 2m 
and since each node has at most 2 outgoing edges that the total number of edges is less than 4m. 
Furthermore, all incoming edges have the same label, and we denote a node with incoming a-edges 
an a-node. Note that the star construction in Fig. QJd) introduces an edge from the accepting 
node of N(S) to the start node of N(S). All such edges in N(R) are called back edges and all other 
edges are forward edges. We need the following important property of N(R). 

Lemma 1 (Myers [1TJ ) Any cycle-free path in N(R) contains at most one back edge. 

For a string Q of length n the standard node-set simulation of N(R) on Q produces a sequence of 
node-sets So,...,S n . A node v is in Si iff there is a path from 9 to v that spells out the ith prefix of 
Q. The simulation can be implemented with the following simple operations. Let S be a node-set 
in N{R) and let a be a character in E. 

Move(S', a): Compute and return the set of nodes reachable from S via a single a-edge. 
Close(S): Compute and return the set of nodes reachable from S via or more e-edges. 
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The number of nodes and edges in N(R) is O(m), and both operations are implementable in 0{m) 
time. The simulation proceed as follows: Initially, Sq := Close({0}). If Q[j] = a, 1 < j < n, then 
Sj- := Close(Move(5j_i, a)). Finally, Q G L(-R) iff ^ G S* n - Since each node-set 5j only depends on 
Sj-i this algorithm uses 0(mn) time O(m) space. 

2.2 Outline of Algorithm 

Our result is based on a new and more compact encoding of small subautomata used within Myers' 
algorithm [17] supporting constant time Move and Close operations. For our purposes and for 
completeness, we restate Myers' algorithm in Sections 12.31 and 12.41 such that the dependency on 
the Move and Close operations on subautomata is exposed. The new encoding is presented in 
Section 12.51 

2.3 Decomposing the NFA 

In this section we show how to decompose N(R) into small subautomata. In the final algorithm 
transitions through these subautomata will be simulated in constant time. The decomposition is 
based on a clustering of the parse tree T(R). Our decomposition is similar to the one given in 
[17\ I32j . A cluster C is a connected subgraph of T(R). A cluster partition CS is a partition of 
the nodes of T(R) into node-disjoint clusters. Since T(R) is a binary tree, a bottom-up procedure 
yields the following lemma. 

Lemma 2 For any regular expression R of size m and a parameter x, it is possible to build a 
cluster partition CS ofT(R), such that \CS\ = 0(m/x) and for any C G CS the number of nodes 
in C is at most x. 

An example clustering of a parse tree is shown in Fig. [2]Jb). 

Before proceeding, we need some definitions. Assume that CS is a cluster partition of T(R) for 
a some yet-to-be-determined parameter x. Edges adjacent to two clusters are external edges and 
all other edges are internal edges. Contracting all internal edges induces a macro tree, where each 
cluster is represented by a single macro node. Let C v and C w be two clusters with corresponding 
macro nodes v and w. We say that C v is a parent cluster (resp. child cluster) of C w if v is the 
parent (resp. child) of w in the macro tree. The root cluster and leaf clusters are the clusters 
corresponding to the root and the leaves of the macro tree. 

Next we show how to decompose N(R) into small subautomata. Each cluster C will correspond 
to a subautomaton A and we use the terms child, parent, root, and leaf for subautomata in the 
same way we do with clusters. For a cluster C, we insert a special pseudo-node pi for each child 
cluster C\, ... ,Ci in the middle of the external edge connecting C and Cj. Now, C's subautomaton 
A is the automaton corresponding to the parse tree induced by the set of nodes V(C) U {p±, . . . ,pi}. 
The pseudo-nodes are alphabet placeholders, since the leaves of a well-formed parse tree must be 
characters. 

In A, child automaton Ai is represented by its start and accepting node 0A t and 4>Ai and a 
pseudo-edge connecting them. An example of these definitions is given in Fig. [2j Any cluster 
C of size at most x has less than 2x pseudo-children and therefore the size of the corresponding 
subautomaton is at most 6x. Note, therefore, that automata derived from regular expressions can 
be thus decomposed into 0(m/z) subautomata each of size at most z, by Lemma [2] and the above 
construction. 
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Figure 2: (a) The parse tree for the regular expression ac\a*b. (b) A clustering of (a) into node- 
disjoint connected subtrees Ci, C2, and C3. Here, x = 3. (c) The clustering from (b) extended with 
pseudo-nodes, (d) The automaton for the parse tree divided into subautomata corresponding to 
the clustering, (e) The subautomaton A\ with pseudo-edges corresponding to the child automata. 

2.4 Simulating the NFA 

In this section we show how to do a node-set simulation of N(R) using the subautomata. We 
compactly represent node-set of each subautomata in a bit string and in the next section we 
will show how to manipulate these node-set efficiently using a combination of the Four Russian 
Technique and standard word operations. This approach is often called bit-parallelism [2]. 

Recall that each subautomaton has size less than z. Topologically sort all nodes in each sub- 
automaton A ignoring back edges. This can be done for all subautomata in total 0{m) time. 
We represent the current node-set S of N(R) compactly using a bitvector for each subautomaton. 
Specifically, for each subautomaton A we store a characteristic bitvector B = [b±, . . . ,b z ], where 
nodes in B are indexed by the their topological order, such that B[i] = 1 iff the ith node is in S. If 
A contains fewer than z nodes we leave the remaining values undefined. For simplicity, we will refer 
to the state of A as the node-set represented by the characteristic vector stored at A. Similarly, 
the state of N(R) is the set of characteristic vectors representing S. The state of a node is the bit 
indicating if the node is in S. Since any child A' of A overlap at the nodes 9a' and 4>a> we will 
ensure that the state of 9 a' and 4>a' is the same in the characteristic vectors of both A and A' . 

Below we present appropriate move and e-closure operations defined on subautomata. Due 
to the overlap between parent and child nodes these operations take a bit b which will use to 
propagate the new state of the start node. For each subautomaton A, characteristic vector B, bit 
b, and character a £ T, define: 
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Move A (B, b, a): Compute the state B' of all nodes in A reachable via a single a edge from B. If 
6 = 0, return B', else return B' U {9 a }■ 

Close" 4 (B, b): Return the set B' of all nodes in A reachable via a path of or more e-edges from 
B, if b = 0, or reachable from B U {9a}, if 6 = 1. 

We will later show how to implement these operations in constant time and total 2°( fc ) space when 
z = 0(fc). Before doing so we show how to use these operations to perform the node-set simulation 
of N(R). Assume that the current node-set of N(R) is represented by its characteristic vector for 
each subautomaton. The following Move and Close operations recursively traverse the hierarchy of 
subautomata top-down. At each subautomata the current state of N(R) is modified using primarily 
Move" 4 and Close" 4 . For any subautomaton A, bit b, and character a G £ define: 

Move(yl, b, a): Let B be the current state of A and let A\, . . . ,Ag be children of A in topological 
order of their start node. 

1. Compute B' := Move A (B,b,a). 

2. For each Ai, 1 < i < i, 

(a) Compute fi := Move(Ai,bi,a), where bi = 1 iff G B'. 

(b) If/^lset B' ■.= B'U{<PA ^ }■ 
3. Store B' and return the value 1 if 4>a G B' and otherwise. 

C\ose(A,b): Let B be the current state of A and let Ai, . . . ,Ag be children of A in topological 
order of their start node. 

1. Compute B' := C\ose A (B, b). 

2. For each child automaton Ai, 1 < i < £, 

(a) Compute fi := 01056(^4^, bi), where bi = 1 if 9a x G B' . 

(b) If fi = 1 set B' :=B'\j{4 >Al }. 

(c) B' := Close" 4 (B, b). 

3. Store B' and return the value 1 if (f>A G S' and otherwise. 

The "store" in line 3 of both operations updates the state of the subautomaton. The node-set 
simulation of N(R) on string Q of length n produces the states So, . . . ,S n as follows. Let A r be 
the root automaton. Initialize the state of N(R) to be empty, i.e., set all bitvectors to 0. So is 
computed by calling C\ose(A r , 1) twice. Assume that Sj-i, 1 < j < n, is the current state of N(R) 
and let a = Q[j]. Compute Sj by calling Move(A r , 0, a) and then calling C\ose(A r , 0) twice. Finally, 
Q G L(R) iff G S n . 

We argue that the above algorithm is correct. To do this we need to show that the call to the 
Move operation and the two calls to the Close operation simulates the standard Move and Close 
operations. 

First consider the Move operation. Let S be the state of N(R) and let S' be the state after 
a call to Move(A r , 0, a). Consider any subautomaton A and let B and B' be the bitvectors of A 
corresponding to states S and S' , respectively. We first show by induction that after Move(A, 0, a) 
the new state B' is the set of nodes reachable from B via a single a-edge in N(R). For Move(A, 1, a) 
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a similar argument shows that new state is the union of the set of nodes reachable from B via a 
single a-edge and {9 a}- 

Initially, we compute B' := Move^(S, 0, a). Thus B' contains the set of nodes reachable via 
a single a-edge in A. If A is a leaf automaton then B' satisfies the property and the algorithm 
returns. Otherwise, there may be an a-edge to some accepting node 4>Ai of a child automaton A{. 
Since this edge is not contained A, </> J 4 i is not initially in B' . However, since each child is handled 
recursively in topological order and the new state of start and accepting nodes are propagated, it 
follows that 4>Ai is ultimately added to B' . Note that since a single node can be the accepting node 
of a child Ai and the start node of child A4+1, the topological order is needed to ensure a consistent 
update of the state. 

It now follows that the state S' of N(R) after Move(A r , 0, a), consists of all nodes reachable via 
a single a-edge from S. Hence, Move(A r , 0, a) correctly simulates a standard Move operation. 

Next consider the two calls to the Close operation. Let S be the state of N(R) and let S' be 
the state after the first call to Close(yl r , 0). As above consider any subautomaton A and let B and 
B' be the bitvectors of A corresponding to S and S 1 , respectively. We show by induction that after 
Close(A, 0) the state B' contains the set of nodes in N(R) reachable via a path of or more forward 
e-edges from B. Initially B' := Close" 4 (B, 0), and hence B' contains all nodes reachable via a path 
of or more e-edges from B, where the path consists solely of edges in A. If A is a leaf automaton, 
the result immediately holds. Otherwise, there may be a path of e-edges to a node v going through 
the children of A. As above, the recursive topological processing of the children ensures that v is 
added to B' . 

Hence, after the first call to Close(v4 r , 0) the state S' contains all nodes reachable from S via 
a path of or more forward e-edges. By a similar argument it follows that the second call to 
Close(j4 r ,0) produces the state S" that contains all the nodes reachable from S via a path of 
or more forward e-edge and 1 back edge. However, by Lemma [1] this is exactly the set of nodes 
reachable via a path of or more e-edges. Furthermore, since Close(^4 r ,0) never produces a state 
with nodes that are not reachable through e-edges, it follows that the two calls to Close(A r ,0) 
correctly simulates a standard Close operation. 

Finally, note that if we start with a state with no nodes, we can compute the state Sq in 
the node-set simulation by calling Close(A r , 1) twice. Hence, the above algorithm correctly solves 
Regular Expression Matching. 

If the subautomata have size at most z and Move 74 and Close" 4 can be computed in constant time 
the above algorithm computes a step in the node-set simulation in 0{m/z) time. In the following 
section we show how to do this in 0(2 k ) space for z = <d(k). Note that computing the clustering 
uses an additional 0{m) time and space. 

2.5 Representing Subautomata 

To efficiently represent Move" 4 and Close" 4 we apply the Four Russian trick. Consider a straightfor- 
ward code for Move" 4 : Precompute the value of Move" 4 for all B, both values of b, and all characters 
a. Since the number of different bitvectors is 2 2 and the size of the alphabet is a, this table has 
2 2+1 cj entries. Each entry can be stored in a single word, so the table also uses a total of 2 z+1 a 
space. The total number of subautomata is 0(m/z), and therefore the total size of these tables is 
an unacceptable 0(y ' 2 2 cr). 

To improve this we use a more elaborate approach. First we factor out the dependency on the 
alphabet, as follows. For all subautomata A and all characters a € £ define: 
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Succ A (B): Return the set of all nodes in A reachable from B by a single edge. 
Eq A (a): Return the set of all a-nodes in A. 

Since all incoming edges to a node are labeled with the same character it follows that, 



Move" 4 {B,b, a) 



Succ^B) n Eq A (a) if b = 0, 

(Succ A (B) n Eq A (a)) U {8 A } if b = 1. 



Hence, given Succ" 4 and Eq A we can implement Move" 4 in constant time using bit operations. To 
efficiently represent Eq" 4 , for each subautomaton A, store the value of Eq A (a) in a hash table. Since 
the total number of different characters in A is at most z the hash table Eq A contains at most z 
entries. Hence, we can represent Eq A for all subautomata is 0(m) space and constant worst-case 
lookup time. The preprocessing time is 0{m) w.h.p.. To get a worst-case preprocessing bound we 
use the deterministic dictionary of [11] with 0(m log m) worst-case preprocessing time. 

We note that the idea of using Eq A (a) to represent the a-nodes is not new and has been used 
in several string matching algorithms, for instance, in the classical Shift-Or algorithm [2] and in 
the recent optimized DFA construction for regular expression matching |23j. 

To represent Succ compactly we proceed as follows. Let A be the automaton obtained by 
removing the labels from edges in A. Succ" 41 and Succ^ 2 compute the same function if A\ = A2. 
Hence, to represent Succ it suffices to precompute Succ on all possible subautomata A. By the one- 
to-one correspondence of parse trees and automata we have that each subautomata A corresponds 
to a parse tree with leaf labels removed. Each such parse tree has at most x internal nodes and 2x 
leaves. The number of rooted, ordered, binary trees with at most 3x nodes is less than 2 6a;+1 , and 
for each such tree each internal node can have one of 3 different labels. Hence, the total number of 
distinct subautomata is less than 2 ex+1 3 x . Each subautomaton has at most 6x nodes and therefore 
the result of Succ" 4 has to be computed for each of the 2 ex different values for B using 0(x2 6x ) 
time. Therefore we can precompute all values of Succ in 0(x2 12x+1 3 x ) time. Choosing x such that 
x + i2+fog3 — 12+Tog 3 §i ves us 0(2 k ) space and preprocessing time. 

Using an analogous argument, it follows that Close" 4 can be precomputed for all distinct subau- 
tomata within the same complexity. By our discussion in the previous sections and since x = @(k) 
we have shown the following theorem: 

Theorem 1 For regular expression R of length m, string Q of length n, and k < w, Regular 
Expression Matching can be solved in Oimnjk + n + mlogm) time and 0(2 k + m) space. 



3 Approximate Regular Expression Matching 

Given a string Q, a regular expression R, and an integer d > 0, the Approximate Regular 
Expression Matching problem is to determine if Q is within edit distance d to a string in 
L(R). In this section we extend our solution for Regular Expression Matching to Approx- 
imate Regular Expression Matching. Specifically, we show that the problem can be solved 
j n Q( mn iog(d+2) _|_ n _|_ m i Q g m) time and 0(2 k + m) space, for any k < w. 

Our result is achieved through a new encoding of subautomata within an algorithm by Wu et 
al. [32] in a style similar to the above result for Regular Expression Matching. For complete- 
ness we restate the algorithm of Wu et al. [32] in Sections 13.11 and 13.21 The new encoding is given 
in Section 13.31 



9 



3.1 Dynamic Programming Recurrence 

Our algorithm is based on a dynamic programming recurrence due to Myers and Miller [18] , which 
we describe below. Let A(v,i) denote the minimum over all paths V between 8 and v of the edit 
distance between V and the ith prefix of Q. The recurrence avoids cyclic dependencies from the 
back edges by splitting the recurrence into two passes. Intuitively, the first pass handles forward 
edges and the second pass propagates values from back edges. The pass-1 value of v is denoted 
Ai(v,i), and the pass-2 value is A2(v,i). For a given i, the pass-1 (resp. pass-2) value of N(R) is 
the set of pass-1 (resp. pass-2) values of all nodes of N(R). For all v and i, we set A(v, i) = A^^v, i). 

The set of predecessors of v is the set of nodes Pre(f) = {w \ (w,v) is an edge}. We define 
Pre(f) = {w j (w,v) is a forward edge}. For notational convenience, we extend the definitions of 
Ai and A2 to apply to sets, as follows: Ai(Pre(u), i) = min tue p re („) Ai(w,i) and Ai(Pre(u),i) = 
m ^ n w&Pre(v) an d analogously for A 2 . The pass-1 and pass-2 values satisfy the following 

recurrence: 



A 2 (6>, i) = Ai(0, i) = i < i < n. 

A 2 (Pre(v), 0) + 1 if v is a E-node, 
A 2 (Pre(u), 0) if v 7^ 6 is an e-node. 



A 2 (u,0) = Ai(u,0) 
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For 1 < i < n, 

., fmin(A 2 (u,i - 1) + 1, A 2 (Pre(u),i) + X(v,Q[i\), Ai(Pre(w),i) + 1) if v is a S-node, 
Ai(v,i) = < 

I Ax(Pre(u),i) if v ^ 6 is an e-node, 



where X(v, Q[i]) = 1 if v is a Q[i]-node and otherwise, 
A 2 (v,i) : 



min(Ai(Pre(v), i), A 2 (Pre(v), i)) + 1 if v is a S-node, 
min(Ai(Pre(f ), i), A 2 (Pre(v), i)) if v is a e-node. 



A full proof of the correctness of the above recurrence can be found in |18|. I32j . Intuitively, the 
first pass handles forward edges as follows: For S-nodes the recurrence handles insertions, substitu- 
tion/matches, and deletions (in this order). For e-nodes the values computed so far are propagated. 
Subsequently, the second pass handles the back edges. For our problem we want to determine if Q 
is within edit distance d. Hence, we can replace all values exceeding d by d + 1. 



3.2 Simulating the Recurrence 

Our algorithm now proceeds analogously to the case with d = above. We will decompose the 
automaton into subautomata, and we will compute the above dynamic program on an appropriate 
encoding of the subautomata, leading to a small-space speedup. 

As before, we decompose N(R) into subautomata of size less than z. For a subautomaton A 
we define operations Next^ and Next^ which we use to compute the pass-1 and pass-2 values of A, 
respectively. However, the new (pass-1 or pass-2) value of A depends on pseudo-edges in a more 
complicated way than before: If A' is a child of A, then all nodes preceding 4>a> depend on the 
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value of <pA'- Hence, we need the value of 4>A' before we can compute values of the nodes preceding 
(f>A r - To address this problem we partition the nodes of a subautomaton as described below. 

For each subautomaton A topologically sort the nodes (ignoring back edges) with the require- 
ment that for each child A' the start and accepting nodes 6a> and 4>a' are consecutive in the order. 
Contracting all pseudo-edges in A this can be done for all subautomata in 0(m) time. Let A±, . . . , A^ 
be the children of A in this order. We partition the nodes in A, except {0a} U {4>A\ , ■ ■ ■ , <f>A e } , into 
£ + 1 chunks. The first chunk is the nodes in the interval [6a + 1, #aJ- If we let 4>A e+1 = <f>A, then 
the ith chunk, 1 < i < I + 1, is the set of nodes in the interval [4>a^ + 1,0aJ- A leaf automaton 
has a single chunk consisting of all nodes except the start node. We represent the ith chunk in A 
by a characteristic vector Lj identifying the nodes in the chunks, that is, Li[j] = 1 if node j is in 
the ith chunk and otherwise. From the topological order we can compute all chunks and their 
corresponding characteristic vectors in total 0{m) time. 

The value of A is represented by a vector B = [bi, . . . , b z ], such that hi 6 [0, d + 1]. Hence, 
the total number of bits used to encode B is z [log d + 2] bits. For an automaton A, characteristic 
vectors B and L, and a character a £ £ define the operations Next^(-B, L, b, a) and Next^(i?, L, b) 
as the vectors B\ and B2, respectively, given by: 

B[v] \iv^L 

{ mm(B[v] + 1, B[Pre(v)} + X(v, a),Bi [Pre(v)] + 1) if v G L is a S-node, 
1 iJ^Pre^)] if v € L is an e-node 

B[v] iiv^L 

j mm(B[Pre(v)}, 5 2 [Pre(u)] + 1) if v G L is a S-node, 
)mm(B[Pre(v)], B2[Pre(v)]) if v ^ L is an e-node 



Bi[v] = 
Bi[v] = 
B 2 [v] = 
B 2 [v] = 



Importantly, note that the operations only affect the nodes in the chunk specified by L. We will use 
this below to compute new values of A by advancing one chunk at each step. We use the following 
recursive operations. For subautomaton A, integer 6, and character a define: 

Nexti( J 4, b, a): Let B be the current value of A and let Ai,...,Ag be children of A in topological 
order of their start node. 

1. Set B x := B and Bi[6 A ] := b. 

2. For each chunk Lj, 1 < i < £, 

(a) Compute B\ := Next^(f?i, Lj, a). 

(b) Compute := Nexti(^, Bi[6 Al ], a). 

(c) Set a.] :=fi. 

3. Compute B\ := Next^(-Bi, Le+i, a). 

4. Return B^a]- 

Next2(A, b): Let B be the current value of A and let A\,...,A^ be children of A in topological 
order of their start node. 

1. Set B 2 := B and B 2 [6 A ] := b. 
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2. For each chunk Li, 1 < i < I, 

(a) Compute B 2 := Next^(5 2 , Lj). 

(b) Compute /i := Next 2 (^ i; B 2 [9 Ai ]). 

(c) Set S 2 [^J :=/i. 

3. Compute B 2 ■= Next^i^, L m ) • 

4. Return -B 2 [0a]- 

The simulation of the dynamic programming recurrence on a string Q of length n proceeds as 
follows: First encode the initial values of the all nodes in N(R) using the recurrence. Let A r be 
the root automaton, let Sj-i be the current value of N(R), and let a = Q[j]. Compute the next 
value Sj by calling Nexti (A r , j, a) and then Next 2 (A r , j, a). Finally, if the value of <f> in the pass-2 
value of S n is less than d, report a match. 

To see the correctness, we need to show that the calls Nexti and Next 2 operations correctly com- 
pute the pass-1 and pass-2 values of N(R). First consider Nexti, and let A be any subautomaton. 
The key property is that if p\ is the pass-1 value of 9a then after a call to Nexti (A,pi, a), the value 
of A is correctly updated to the pass-1 value. This follows by a straightforward induction similar 
to the exact case. Since the pass-1 value of 9 after reading the jth prefix of Q is j, the correctness 
of the call to Nexti follows. For Next 2 the result follows by an analogous argument. 

3.3 Representing Subautomata 

Next we show how to efficiently represent Next^ 1 and Next^. First consider Next^. Note that again 
the alphabet size is a problem. Since the B\ value of a node in A depends on other B\ values in A 
we cannot "split" the computation of Nextf as before. However, the alphabet character only affects 
the value of X(v,a), which is 1 if v is an a-node and otherwise. Hence, we can represent \(v,a) 
for all nodes in A with Eq^(a) from the previous section. Recall that Eq A (a) can be represented 
for all subautomata in total 0{m) space. With this representation the total number of possible 
inputs to Next^ can be represented using (d+2) z + 2 2z bits. Note that for z = log (^ +2 ) we have that 

(d + 2) z = 2 k . Furthermore, since is now alphabet independent we can apply the same trick 

as before and only precompute it for all possible parse trees with leaf labels removed. It follows that 
we can choose z = ®( log ^ +2 ) such that Next^f can precomputed in total 0(2 k ) time and space. An 

analogous argument applies to Next^. Hence, by our discussion in the previous sections we have 
shown that, 

Theorem 2 For regular expression R of length m, string Q of length n ; and integer d > Ap- 
proximate Regular Expression Matching can be solved in 0( mwlo |( d + 2 ) +n + mlogm) time 
and 0(2 k + m) space, for any k < w. 

4 String Edit Distance 

The String Edit Distance problem is to compute the minimum number of edit operations 
needed to transform a string S into a string T. Let m and n be the size of S and T, respectively. 
The classical solution to this problem, due to Wagner and Fischer [29J, fills in the entries of an 
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m + 1 x n + 1 matrix D. The entry Djj is the edit distance between S'fL.i] and T[l..j], and can be 
computed using the following recursion: 

A,o = i 
D 0,j = 3 

Dij = min{A-i,i-i + A(i, j), A-ij + 1, Aj-l + 1} 

where X(i,j) = if S[i] = T[j] and 1 otherwise. The edit distance between S and T is the entry 
D m ,n- Using dynamic programming the problem can be solved in 0(mn) time. When filling out the 
matrix we only need to store the previous row or column and hence the space used is 0(min(m, n)). 
For further details, see the book by Gusfield [TOj, Chap. 11]. 

The best algorithm for this problem, due to Masek and Paterson [14] . improves the time to 
0(^T + rn + n) time and 0(2 fc + min(m, n)) space, for any k <w. This algorithm, however, assumes 

that the alphabet size is constant. In this section we give an algorithm using 0( mn ^° g k + m + n) 
time and 0(2 k + min(m,n)) space, for any k <w, that works for any alphabet. Hence, we remove 
the dependency of the alphabet at the cost of a log 2 k factor. 

We first describe the algorithm by Masek and Paterson [13], and then modify it to handle 
arbitrary alphabets. The algorithm uses the Four Russian Technique. The matrix D is divided 
into cells of size x x x and all possible inputs of a cell is then precomputed and stored in a table. 
From the above recursion it follows that the values inside each cell C depend on the corresponding 
substrings in S and T, denoted Sc and Tc, and on the values in the top row and the leftmost 
column in C. The number of different strings of length x is a x and hence there are a 2x possible 
choices for Sc and Tc- Masek and Paterson [13] showed that adjacent entries in D differ by at most 
one, and therefore if we know the value of an entry there are exactly three choices for each adjacent 
entry. Since there are at most m different values for the top left corner of a cell it follows that the 
number of different inputs for the top row and the leftmost column is m3 2x . In total, there are at 
m(a3) 2x different inputs to a cell. Assuming that the alphabet has constant size, we can choose 
x = @(k) such that all cells can be precomputed in 0(2 k ) time and space. The input of each cell 
is stored in a single machine word and therefore all values in a cell can be computed in constant 
time. The total number of cells in the matrix is 0{^) and hence this implies an algorithm using 
Oijvr + m + n ) time and 0(2 k + min(m, n)) space. 

We show how to generalize this to arbitrary alphabets. The first observation, similar to the 
idea in Section [3] is that the values inside a cell C does not depend on the actual characters of Sc 
and Tc, but only on the A function on Sc and Tc- Hence, we only need to encode whether or not 
Sc[i] = Tc[j] for all 1 < i,j < x. To do this we assign a code c(a) to each character a that appears 
in Tc or Sc as follows. If a only appears in only one of Sc or Tc then c(a) = 0. Otherwise, c(a) is 
the rank of a in the sorted list of characters that appears in both Sc and Tc- The representation 
is given by two vectors Sc and Tc of size x, where Sc[i] = c(Sc[i\) and Tc[i] = c(Tc[i\), for all i, 
1 < i < x. Clearly, Sc[i] = Tc[j] iff Sc[i] = Tc\j] and Sc[i] > and Tc\j] > and hence Sc and 
Tc suffices to represent A on C. 

The number of characters appearing in both Tc and Sc is at most x and hence each entry 
of the vectors is assigned an integer value in the range [l,x]. Thus, the total number of bits 
needed for both vectors is 2x \logx + 1]. Hence, we can choose x = ©(j^g) such that the input 
vectors for a cell can be represented in a single machine word. The total number of cells becomes 
0(2£) = 0( nml ° 2 g k ). Hence, if the input vectors for each cell is available we can use the Four 
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Space 


Preprocessing 


Query 


0{na) 


0(na) 


O(m) 


0(n logo") 


0(n logo") 


0(m logo") 


0(n) 


Oin) 


0(m log n) 



Table 1: Trade-offs for Subsequence Indexing. 



Russian Technique to get an algorithm for String Edit Distance using 

Q( nm log k +m + n) time 

and 0(2 k + min(m, n)) space as desired. 

Next we show how to compute vectors efficiently. Given any cell C, we can identify the characters 
appearing in both Sc and Tq by sorting Sc and then for each index i in Tc use a binary search to 
see if Tq[i\ appears in Sc- Next we sort the characters appearing in both substrings and insert their 
ranks into the corresponding positions in Sc and Tc- All other positions in the vectors are given 
the value 0. This algorithm uses O(xlogx) time for each cell. However, since the number of cells 
is O(^tt) the total time becomes 0( nm ^ oga: ), which for our choice of x is Q( nm 0°g fe ) y t/ improve 
this we group the cells into macro cells of y x y cells. We then compute the vector representation 
for each of these macro cells. The vector representation for a cell C is now the corresponding 
subvectors of the macro cell containing C. Hence, each vector entry is now in the range [0, . . . , xy] 
and thus uses \\og(xy + 1)] bits. Computing the vector representation uses 0{xy\og{xy)) time for 
each macro cell and since the number of macro cells is O(^^p) the total time to compute it is 
Q^ ramiog(xy) _|_ m _|_ n ^ j|. f rj ows that we can choose y = A; log k and x = ©(j^^) such that vectors 
for a cell can be represented in a single word. With this choice of x and y we have that xy = @(k 2 ) 
and hence all vectors are computed in Q^ nml °^ xy ^ _|_ m _|_ n ) = 0{ nm ^2 g k + m + n) time. Computing 
the distance matrix dominates the total running time and hence we have shown: 

Theorem 3 For strings S and T of length n and m, respectively, String Edit Distance can be 
solved in 0( mn ^° g k + m + n) time and 0(2 k + min(m, n)) space. 

5 Subsequence Indexing 

The Subsequence Indexing problem is to preprocess a string T to build a data structure support- 
ing queries of the form: "is Q a subsequence of T?" for any string Q. This problem was considered 
by Baeza- Yates [3] who showed the trade-offs listed in Table [H We assume throughout the section 
that T and Q have n and m, respectively. For properties of automata accepting subsequences of 
string and generalizations of the problem see the recent survey [8]. 

Using recent data structures and a few observations we improve all previous bounds. As a 
notational shorthand, we will say that a data structure with preprocessing time and space f(n, a) 
and query time g(m,n,a) has complexity (f(n,a),g(m,n,cr)) 

Let us consider the simplest algorithm for Subsequence Indexing. One can build a DFA 
of size 0(no~) for recognizing all subsequences of T. To do so, create an accepting node for each 
character of T, and for node «j, corresponding to character T[i], create an edge to Vj on character 
a if T[j] is the first a after position %. The start node has edges to the first occurrence of each 
character. Such an automaton yields an algorithm with complexity (0(na),0(m)). 

An alternative is to build, for each character a, a data structure D a with the positions of a in 
T. D a should support fast successor queries. The D^s can all be built in a total of linear time and 
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space using, for instance, van Emde Boas trees and perfect hashing [27J 12S1 IS] . These trees have 
query time O(loglogn). We use these vEB trees to simulate the above automaton-based algorithm: 
whenever we are in state Vi, and the next character to be read from P is a, we look up the successor 
of i in D a in O (log log n) time. The complexity of this algorithm is (0(n), O(mloglogn). 

We combine these two data structures as follows: Consider an automaton consisting of nodes 
til, ... , u n / a , where node U{ corresponds to characters T[a(i — 1), . . . , oi — 1], that is, each node 
Ui corresponds to a nodes in T. Within each such node, apply the vEB based data structure. 
Between such nodes, apply the full automaton data structure. That is, for node Wi, compute the 
first occurrence of each character a after T[ai — 1]. Call these long jumps. A edge takes you 
to a node Uj, and as many characters of P are consumed with uj as possible. When no valid 
edge is possible within Wj, take a long jump. The automaton uses • a) = 0(n) space and 
preprocessing time. The total size of the vEB data structures is 0(n). Since each m consist of at 
most a nodes, the query time is improved to O (log logo - ). Hence, the complexity of this algorithm 
is {0(n), 0(m log logo - )}. To get a trade-off we can replace the vEB data structures by a recent 
data structure of Thorup [26j Thm. 2]. This data structure supports successor queries of x integers 
in the range [1,X] using 0(xX l l 2 ) preprocessing time and space with query time 0(1 + 1), for 
< I < log log X. Since each of the nja groups of nodes contain at most a nodes, this implies the 
following result: 



Theorem 4 Subsequence Indexing can he solved in ^(na 1 / 2 '), 0{m{l + 1))^, for < £ < 
log log a. 

Corollary 1 Subsequence Indexing can be solved in (0(na e ),0(m)) or (0(n), 0(m log log a)}. 



We note that using a recent data structure for rank and select queries on large alphabets by 
Golynski et al. [9] we can also immediately obtain an algorithm using time O(mloglogcr) and space 
n log a + o(n log a) bits. Hence, this result matches our fastest algorithm while improving the space 
from 0(n) words to the number of bits needed to store T. 

6 Acknowledgments 

Thanks to the anonymous reviewers for many detailed and insightful comments. 

References 

[1] V. L. Arlazarov, E. A. Dinic, M. A. Kronrod, and I. A. Faradzev. On economic construction 
of the transitive closure of a directed graph (in russian). english translation in soviet math, 
dokl. 11, 1209-1210, 1975. Dokl. Acad. Nauk., 194:487-488, 1970. 



[2] R. Baeza- Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 
35 (10): 74-82, 1992. 

[3] R. A. Baeza- Yates. Searching subsequences. Theor. Comput. Sci., 78(2):363-376, 1991. 



Proof. We set t to be a constant or log log a, respectively. 



□ 



15 



[4] R. A. Baeza- Yates and G. H. Gonnet. Fast text searching for regular expressions or automaton 
searching on tries. J. ACM, 43(6):915-936, 1996. 

[5] R. A. Baeza- Yates and G. Navarro. Faster approximate string matching. Algorithmica, 
23(2):127-158, 1999. 

[6] D. Barbosa, A. O. Mendelzon, L. Libkin, L. Mignet, and M. Arenas. Efficient incremental 
validation of XML documents. In Proceedings of the 20th International Conference on Data 
Engineering, page 671, 2004. 

[7] P. Bille. New algorithms for regular expression matching. In Proceedings of the 33rd Interna- 
tional Colloquium on Automata, languages and Programming, pages 643-654, 2006. 

[8] M. Crochemore, B. Melichar, and Z. Tromcek. Directed acyclic subsequence graph: overview. 
J. of Discrete Algorithms, l(3-4):255-280, 2003. 

[9] A. Golynski, J. I. Munro, and S. S. Rao. Rank/select operations on large alphabets: a tool 
for text indexing. In Proceedings of the 17th annual ACM-SIAM Symposium on Discrete 
Algorithms, pages 368-373, 2006. 

[10] D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational 
biology. Cambridge, 1997. 

[11] T. Hagerup, P. B. Miltersen, and R. Pagh. Deterministic dictionaries. J. of Algorithms, 
41(l):69-85, 2001. 

[12] H. Hosoya and B. Pierce. Regular expression pattern matching for XML. In Proceedings of the 
28th ACM SIGPIAN-SIGACT symp. on Principles of programming languages (POPL, pages 
67-80, 2001. 

[13] Q. Li and B. Moon. Indexing and querying XML data for regular path expressions. In 
Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), pages 
361-370, 2001. 

[14] W. Masek and M. Paterson. A faster algorithm for computing string edit distances. J. Comput. 
Syst. Sci., 20:18-31, 1980. 

[15] K. Mehlhorn and S. Nahler. Bounded ordered dictionaries in o(loglogn) time and o(n) space. 
Inf. Process. Lett., 35(4):183-189, 1990. 

[16] M. Murata. Extended path expressions of XML. In Proceedings of the twentieth ACM 
SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS), pages 
126-137, 2001. 

[17] E. W. Myers. A four-russian algorithm for regular expression pattern matching. J. of the 
ACM, 39(2):430-448, 1992. 

[18] E. W. Myers and W. Miller. Approximate matching of regular expressions. Bull, of Math. 
Biology, 51:5-37, 1989. 



16 



[19] G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic 
programming. J. ACM, 46(3):395-415, 1999. 

[20] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(l):31-88, 
2001. 

[21] G. Navarro. NR-grep: a fast and flexible pattern-matching tool. Software - Practice and 
Experience, 31(13):1265-1312, 2001. 

[22] G. Navarro. Approximate regular expression searching with arbitrary integer weights. Nordic 
J. of Computing, ll(4):356-373, 2004. 

[23] G. Navarro and M. Raffinot. New techniques for regular expression searching. Algorithmica, 
41(2):89-116, 2004. 

[24] R. M. Stallman. Emacs the extensible, customizable self-documenting display editor. SIG- 
PLAN Not, 16(6):147-156, 1981. 

[25] K. Thompson. Regular expression search algorithm. Comm. of the ACM, 11:419-422, 1968. 

[26] M. Thorup. Space efficient dynamic stabbing with fast queries. In Proceedings of the symposium 
on Theory of computing (STOC), pages 649-658, 2003. 

[27] P. van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. 
Inf. Process. Lett, 6(3):80-82, 1977. 

[28] P. van Emde Boas, R. Kaas, and E. Zijlstra. Design and implementation of an efficient priority 
queue. Mathematical Systems Theory, 10:99-127, 1977. 

[29] R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168- 
173, 1974. 

[30] S. Wu and U. Manber. Agrep - a fast approximate pattern-matching tool. In Proceedings 
USENIX Winter 1992 Technical Conference, pages 153-162, San Francisco, CA, 1992. 

[31] S. Wu and U. Manber. Fast text searching: allowing errors. Commun. ACM, 35(10):83-91, 
1992. 

[32] S. Wu, U. Manber, and E. W. Myers. A subquadratic algorithm for approximate regular 
expression matching. J. of Algorithms, 19(3):346-360, 1995. 



17 



